perm filename CHAP8.TEX[WEB,ALS] blob sn#690215 filedate 1982-12-15 generic text, type T, neo UTF8
\chapterbegin Chapter 8. The Characters\\You Type

A lot of different keyboards are used with \TeX, but few keyboards can
produce 128 different symbols. Furthermore, as we have seen, some of the
characters that you {\sl can\/} type on your keyboard are reserved for
special purposes like escaping and grouping. Yet when we studied fonts it
was pointed out that there are 256 characters per font. So how can you
refer to the characters that aren't on your keyboard, or that have been
pre-empted for formatting?

One answer is to use control sequences. For example, the plain format
of Appendix B\null, which defines |%| to be an end-of-line symbol so that you
can use it for comments, also defines the control sequence |\%| to mean
a ↑{percent sign}.

To get access to any character whatsoever, you can type
$$\dbox{|\char|\<number>\hss}$$
where \<number> is any number from 0 to 255 (optionally followed by a space);
you will get the corresponding character from the current font. That's how
Appendix@B handles |\%|; it defines `|\%|' to be an abbreviation for
`|\char37|\vspace', since 37 is the character code for a percent sign.

The codes that \TeX\ uses internally to represent characters are based on
``↑{ascii},'' the American Standard Code for Information Interchange.
Appendix@C gives full details of this code, which assigns numbers to
certain control functions as well as to ordinary letters and punctuation
marks.  For example, ↑{<space}${}=32$ and ↑{<carriage-return}${}=13$.
There are 94@standard visible symbols, and they have been assigned code
numbers from 33 to@126, inclusive.

It turns out that `|b|' is character number 98 in ascii. So you can
typeset the word |bubble| in a strange way by putting
\ttbegin
\char98 u\char98\char98 le
\ttend
into your manuscript, if the |b|-key on your typewriter is out of order. \
(Of course you need the |\|, |c|, |h|, |a|, and |r| keys to type `↑{*char}',
so let's hope that they are always working.)

\danger \TeX\ always uses the internal character code of Appendix@C
for the standard ascii characters,
regardless of what external coding scheme actually appears in the files
being read.  Thus, |b| is 98 inside of \TeX\ even when your computer
normally deals with ↑{EBCDIC} or some other non-ascii scheme; the \TeX\
software has been set up to convert text files to internal code, and to
convert back to the external code when writing text files.
Device-independent (↑{.dvi}) output files use \TeX's internal code. In
this way, \TeX\ is able to give identical results on all computers.

\danger Character code tables like those in Appendix@C often give the code
numbers in {\sl ↑{octal notation}}, i.e., the radix-8 number system, in which
the digits are {\sl0},@{\sl1}, {\sl2}, {\sl3}, {\sl4}, {\sl5}, {\sl6},
and@{\sl7}.\footnote*{The author of this manual likes to use italic digits
for octal numbers, and typewriter type for hexadecimal numbers, in order
to provide a typographic clue to the underlying radix whenever possible.}
Sometimes {\sl↑{hexadecimal notation}\/} is also used, in which case the
digits are |0|,@|1|, |2|, |3|, |4|, |5|, |6|, |7|, |8|, |9|, |A|, |B|, |C|,
|D|, |E|, and@|F|. For example, the octal code for `|b|' is {\it142}, and
its hexadecimal code is |62|. A ↑{<number} in \TeX's language can begin
with@a@|'|, when it is understood as octal, or with a |"|, when it is
understood as hexadecimal. Thus, |\char'142| and |char"62| and equivalent
to |char98|. The legitimate character codes in octal notation run from
\oct0 to \oct{377}; in hexadecimal, they run from \hex0 to \hex{FF}.
↑(apostrophe)↑(doublequote)

\danger But \TeX\ actually provides another kind of \<number> that makes it
unnecessary for you to know ascii at all! The token |`|$↓{12}$, when followed
by any character token or by any control sequence that corresponds to a
single character, stands for \TeX's internal code for the character in
question. For example, |\char`b| and |\char`\b| are also equivalent to
|\char98|. ↑(reverse apostrophe)
If you look in Appendix@B to see how |\%| is defined, you'll notice that
the definition is
\ttbegin
\def\%{\char`\%}
\ttend
instead of |\char37| as claimed above.

\dangerexercise What would be wrong with `|\def\%{\char`%}|'\thinspace?
\answer The |%| would be treated as a comment character, because its
category code is@14; thus, no |%| token or |}| token would get through
to the gullet of \TeX\ where numbers are treated. When a character is
of category 0, 5, 9, 14, or@15, the extra |\| must be used; and the
|\| doesn't hurt, so you can always use it to be safe.

\ddangerexercise The preface to this manual points out that you can
expect to discover little white lies from time to time. Well, if you actually
check Appendix@B you'll find that
\ttbegin
\let\%\relax \edef\%{\char`\%\space}
\ttend
is the true definition of |\%|. Why is this a good way to define it?
\answer \TeX\ removes an optional ↑{space} after each \<number>. The
definitions suggested earlier in this chapter would work in most cases,
since the user should expect a space to disappear after |\%| anyway.
The |\edef| in Appendix@B puts a space token after the |\%| token, and
this is better in certain situations. Consider, for example,
\hbox{|\def\a#1{\hbox{#1 }}|} followed by |\a\%|. \ (The `|\let\%\relax|'
in Appendix@B is necessary to avoid an ``undefined control sequence'' error
when the definition is being expanded. An interesting alternative would be
\ttbegin
\edef\%{\char\number`\%\space}
\ttend
which is equivalent to `|\def\%{\char37 }|' and needs no prior |\let|.)

Although you can use |\char| to access any character in the current
font, you can't use it in the middle of a control sequence. For example,
if you type
\ttbegin
\\char98
\ttend
\TeX\ reads this as the control sequence |\\| followed by |c|, |h|, |a|,
etc., not as the control sequence |\b|.

You will hardly ever need to use |\char| when typing a manuscript, since
the characters you want will probably be available as predefined control
sequences; |\char| is primarily intended for the designers of book formats
like those in the appendices. But some day you may require a ↑{special
symbol}, and you may have to hunt through a font catalog until you find
it. Once you find it, you can use it by simply selecting the appropriate
font and then specifying the character number with |\char|. For example,
the ``↑{dangerous bend}'' sign used in this manual appears as character
number@127 of font ↑{.cmathx}, and that font is selected by the control
sequence ↑{:tenex}. The macros in Appendix@E therefore display dangerous
bends by saying `|{\tenex\char127}|'.

We have observed that the ascii character set includes only 94 printable
symbols; but \TeX\ works internally with 128 different character codes,
from 0 to 127, each of which is assigned to one of the sixteen categories
described in Chapter@7. If your keyboard has additional symbols, or if it
doesn't have the standard@94, the people who installed your local \TeX\ system
can tell you the correspondence between what you type and the character
number that \TeX\ receives. Some people are fortunate enough to have keys
marked `{\tt\rlap/=}' and `{\tt\rlap<\char'32}' and `{\tt\rlap>\char'32}';
it is possible to install \TeX\ so that it will recognize these handy symbols
and make the typing of mathematics more pleasant. But if you do not have
such keys, you can get by with the control sequences ↑{:ne}, ↑{:le},
and ↑{:ge}. ↑(not-equal)↑(less-or-equal)↑(greater-or-equal)

\danger \TeX\ has a standard way to refer to the invisible characters of
ascii: Code@0 can be typed |↑↑@|, code@1 can be typed |↑↑A|, and so on up
to code@31, which is |↑↑_|; you use the characters |@|, |A|, $\ldotss$, @|_|
(whose ascii equivalents are 64 to 95) to get codes that differ by 64.
Also, code 127 can be typed |↑↑?|; the dangerous bend sign could therefore
be obtained by saying `|{\tenex↑↑?}|'. However, you must change
the category code of character 127 before using it, since this character
ordinarily has category@15 (invalid); say, e.g., `|\catcode`↑↑?=12|'.
↑(double caret)
The |↑↑| notation is different from |\char|, because |↑↑| combinations can
be used as if they were single characters; for example, it would not
be permissible to say |\catcode`\char127|, but |↑↑| symbols can even be
used as letters within control sequences.

\danger One of the overfull box messages in Chapter 6 illustrates the fact
that \TeX\ sometimes uses the funny |↑↑| convention in its output:
the umlaut character in that example appears as |↑↑S|, and the cedilla appears
as@|↑↑X|, because `\thinspace\char'23\thinspace' and `\char'30' occur in
positions \oct{23} and@\oct{30} of the ↑{:tenrm} font.

\danger Most of the |↑↑| codes are unimportant except in special applications.
But |↑↑M| is particularly noteworthy because it is code 13, the ascii
↑{<carriage-return} that \TeX\ places at the right end of every line of
your input file. By changing the category of |↑↑M| you can obtain useful
special effects, as we shall see later.

\ddanger People who install \TeX\ systems for use with non-American alphabets
are advised to use character codes less than 32 for any additional letters,
and to assign category@10 (letter) to those codes. For example, suppose
you have a ↑{Norwegian keyboard} that contains the letter {\tt\ae}.
↑(Scandinavian letters) ↑(foreign languages)
You could design your \TeX\ interface so that this letter comes in as
code@28,\footnote*{There's nothing magic about this number 28, except that
by coincidence the Computer Modern fonts of plain \TeX\ happen to have
an `\ae' in position@28 already. Some change to the font layout is inevitable,
however, since all six of the special letters \ae, \o, \aa, \AE,
\O, and \AA\ should be assigned to positions less than 32. Characters
already in those positions can easily be moved to positions greater than
127, since they are never accessed by plain \TeX\ except via control
sequences.}  say, and your standard format package should define
|\catcode`|{\tt\ae}|=10|. Then you could have control sequences like
|\s|{\tt\ae}|rtrykk|; and your \TeX\ input files would be readable by
American installations of \TeX\ that don't have your keyboard, by
substituting |↑↑\| for character@7. \ (For example, the stated control
sequence would appear as |\s↑↑\rtrykk| in the file; your American
friends should also be provided with the format that you used, with its
|\catcode`↑↑\=10|.) \ Of course you should also arrange your fonts
so that \TeX's character 28 will print as \ae; and you should
change \TeX's hyphenation algorithm so that it will do correct
Norwegian hyphenation. The main point is that such changes are not
extremely difficult; nothing in the design of \TeX\ limits it to the
American alphabet, as long as you have at most 128 different characters.
↑(keyboards, non-ascii)

\danger But wait, you say. Why are characters numbered from 0 to@127,
when fonts can contain up to 256 different symbols? The answer is that
\TeX\ can access positions 128 to 255 of a font in several reasonably
convenient ways, even though its character tokens are coded from 0 to@127.
You can use |\char|, generally via a control sequence, as already
mentioned; and the higher positions of a font can conveniently be occupied
by math symbols, as we shall see later. Another important way to generate
codes above 127 is by sequences of keystrokes (i.e., ↑{ligatures}), when
the font has been set up properly. It is often faster to touch-type a
sequence of letters than to hunt for a single key on a large keyboard;
thus the restriction to 128 typable characters is not unreasonable.

\ddanger For example, let's consider Norwegian again, but suppose that you want
to use a keyboard without an {\tt\ae} character. You can arrange the ↑{font
metric file} so that \TeX\ will interpret `|ae|' as a ligature that
produces `\ae'; and you could put the character `\ae' in position 128
of the font. Similarly, you could define ligatures `|aa|' and `|o/|'
to produce `\aa' and `\o' in positions 129 and 130; and there would
also be `AE', `AA', and `O/' for `\AE', `\AA', and `\O' in positions
131 to 133. By setting |\catcode`/=10| you would be able
to use the ligature |o/| in control sequences like `|\ho/yre|'.
\TeX's hyphenation method is not confused by ligatures; so you could use
this scheme to operate essentially as above, but with two keystrokes
in place of one. \ (Your typists would have to watch out for
the occasional times when the adjacent characters |aa|, |oe|, and |o/| should
not be treated as ligatures.)

\danger The rest of this chapter is devoted to \TeX's reading rules,
which define the conversion from text to tokens. For example, the fact
that \TeX\ ignores spaces after control sequences is a consequence of
the rules below, which imply among other things that spaces after control
sequences never become space tokens. The rules are intended to work the
way you would expect them to, so you may not wish to bother reading them;
but when you are communicating with a computer, it is nice to understand
what the machine thinks it is doing, and here's your chance.

\danger Whenever \TeX\ is reading a line of text from a file, or a line of
text that you entered directly on your terminal, the reading apparatus is
in one of three so-called ↑{states}:
$$\displayvbox{\halign{State $#$\qquad\hfil&#\hfil\cr
N&Beginning a new line;\cr
M&Middle of a line;\cr
S&Skipping blanks.\cr}}$$
At the beginning of the line it's in state $N$, but most of the time it's
in state $M$, and after a control sequence or a space it's in state $S$.
Incidentally, ``states'' are different from the ``↑{modes}'' that we will
be studying later; the current {\sl state\/} refers to \TeX's eyes and
mouth as they take in characters of new text, but the current {\sl mode\/}
refers to the condition of \TeX's gastro-intestinal tract. Most of the
things that \TeX\ does when it converts characters to ↑{tokens} are independent
of the current state, but there are differences when spaces or end-of-line
characters are detected (categories 5 and 10).

\danger \TeX\ always inserts a ↑{<carriage-return} character (number 13)
at the right end of each line in a file, and at the end of each line read
from the terminal in response to `↑{.*}' or `↑{.**}'; but nothing
additional is placed at the end of lines that were inserted with `|I|'
during ↑{error recovery}.  Since it is possible to change the category of
a \<carriage-return>---you can even make it an escape character!---the
rules need to be spelled out precisely.

\danger If \TeX\ sees an escape character (category 0) in any state, it
scans the entire ↑{control sequence}, converts it to a control sequence
token, and goes to state $S$.  \TeX\ will next read the character
following the control sequence name, unless the name appears at the very
end of the line.  The process of scanning the entire control sequence means:
(a)@Look at the next character in the line. If there is no next character,
the control sequence name is empty (like ↑(null control sequence)↑(empty
control sequence)|\csname\endcsname|). Otherwise (b)@If the next character
is not of categorye11 (letter), the control sequence name
consists of that single character.  Otherwise (c)@The control sequence
name consists of all letters beginning with the current one and ending
at the first nonletter, or at the end of the line.

\danger If \TeX\ sees a superscript character (category 7) in any state,
and if that character is followed by another identical character, and if
those two equal characters are followed by a character whose internal
code is between 63 and 95 inclusive, these three characters are replaced
by a single character, whose code is obtained by adding or subtracting
64 from the code of the third character. (Thus, |↑↑A| is
replaced by a character whose code is@1, etc., as explained in
Chapter@7.) This replacement is carried out also if such a trio of
characters is encountered during steps (b) or@(c) of the control-sequence
scanning procedure described above. After the replacement is made, \TeX\
begins again as if the new character had been present all the time.
If a superscript character is not the first of such a trio, it is
handled by the following rule.

\danger If \TeX\ sees a character of categories 1, 2, 3, 4, 6, 8, 11, or@12,
or a character of category@7 that is not the first of a trio as just
described, it converts the character to a token by attaching the category
code, and goes into state@$M$. This is the normal case; almost every
nonblank character is handled by this rule.

\danger If \TeX\ sees an end-of-line character (category 5), it throws
away any other information that might remain on the current line. Then if
\TeX\ is in state@$N$ (new line), the end-of-line character is converted
to a control sequence token for `↑{*par}' (end of paragraph); if \TeX\ is
in state@$M$ (mid-line), the end-of-line character is converted to a token
for character@32 (`\vspace') of category@10 (↑{space}); and if \TeX\ is in
state@$S$ (skipping blanks), the end-of-line character is simply dropped.

\danger If \TeX\ sees a character to be ignored (category@9), it simply
passes that character and remains in the same state.

\danger If \TeX\ sees a character of category@10 (space), the action
depends on the current state. If \TeX\ is in state $N$ or $S$, the
character is simply passed by, and \TeX\ remains in the same state.
Otherwise \TeX\ is in state $M$; the character is converted to a token
of category@10 whose character code is@32, and \TeX\ enters state@$S$.
All spaces are made equal, because they tend to look equal when displayed.

\danger If \TeX\ sees an active character (category 13), it converts the
character to a control sequence token and goes to state $M$. Control
sequences for active characters are independent of the control sequences
formed by an escape prefixed to a single character; e.g., |@| and
|\@| are distinct control sequences.

\danger If \TeX\ sees a comment character (category@14), it throws away that
character and any other information that might remain on the current line.

\danger Finally, if \TeX\ sees an invalid character (category@15),
it bypasses that character, prints an error message, and remains in the
same state.

\danger If \TeX\ has nothing more to read on the current line, it goes to
the next line (if any) and enters state $N$. A blank line is appended to
the end of every text file; this line has no characters, but a
\<carriage-return> is placed after it in the usual manner. Therefore
most files effectively end with `↑{*par}'.

\dangerexercise Test your understanding of \TeX's reading rules by answering
the following quickie questions: (a)@What is the difference between
categories 5 and@14? (b)@What is the difference between categories 3
and@4?  (c)@What is the difference between categories 11 and@12?  (d)@Are
spaces ignored after active characters?  (e)@When a line ends with a comment
character like |%|, are spaces ignored at the beginning of the next line?
(f)@Can an ignored character appear in the midst of a control sequence name?
\answer (a)@Both characters terminate the current line; but a character of
category@5 might be converted into a space token or a |\par| token, while
a character of category@14 never produces a token.  (b)@They produce
character tokens stamped with different category numbers.  For example,
|$|$↓3$ is not the same token as |$|$↓4$, so \TeX's digestive processes
will treat them differently.  (c)@Same as@(b), plus the fact that control
sequence names treat letters differently. It turns out that \TeX's
digestive processes treat categories 11 and 12 identically, except that
the category code is significant in the ↑{*ifcat} and ↑{*ifx} tests and
when looking for the end of a macro argument.  (d)@No. (e)@Yes; they're
ignored at the beginning of every line, since every line starts in
state@$N$. (f)@No.

\dangerexercise Look again at the error messages that appear near the
end of Chapter@6. When \TeX\ reported that |\vship| was an undefined
control sequence, it printed two lines of context, showing that
it was in the midst of reading line@2 of the |story| file. At the
time of that error message, what state was \TeX\ in? What character
was it about to read next?
\answer \TeX\ had just read the control sequence |\vship|, so it
was in state@$S$, and it was just ready to read the space before `|1in|'.
Afterwards it ignored that space, since it was in state@$S$; but if
you had typed |I\obeyspaces| in response to that error message,
you would have seen the space. Incidentally, when \TeX\ prints
the ↑{context of an error message}, the bottom pair of lines comes from
a text file, but the other pairs of lines are portions of token lists
that \TeX\ is reading (unless they begin with `|<*>|', when they
represent text inserted during ↑{error recovery}).

\dangerexercise Given the category codes of plain \TeX\ format,
what tokens are produced from the input line
`| $x↑2$@  \TeX  ↑↑C|'\thinspace?
\answer |$|$↓{3}$ |x|$↓{10}$ |↑|$↓7$ |2|$↓{12}$ |$|$↓{3}$ |@| \vspace$↓{10}$
|\TeX| |↑↑C|$↓{12}$ \vspace$↓{10}$. The final space comes from the
\<carriage-return> placed at the end of the line. The character code
for |↑↑C| is@3.

\dangerexercise Consider an input file that contains exactly
three lines; the first line says `|Hi!|', while the other two lines
are completely blank. What tokens are produced when \TeX\ reads
this file, using the category codes of plain \TeX\ format?
\answer |H|$↓{11}$ |i|$↓{11}$ |!|$↓{12}$ \vspace$↓{10}$ |\par|
|\par| |\par|. The `\vspace' comes from the \<carriage-return> at the
end of the first line; the second and third lines each contribute
a |\par|; and the final |\par| comes from the additional blank line
inserted by \TeX\ at the end of each input file.

\ddangerexercise Assume that the category codes of plain \TeX\ are in
force, except that the characters |↑↑A|, |↑↑B|, |↑↑C|, |↑↑M| belong
respectively to categories 0, 7, 10, and 11. What tokens are produced from
the (somewhat ridiculous) input line `|↑↑B↑↑BM↑↑A↑↑B↑↑M↑↑@↑↑C\M|'?
(Remember that this line is followed by \<carriage-return>, which is
|↑↑M|; and recall that |↑↑@| denotes the ↑{<null} character, which has
category@9 when |INITEX| begins.)
\answer The two |↑↑B|'s are not recognized as consecutive superscript
characters (sigh), so the result is seven tokens: |↑↑B|$↓7$ |↑↑B|$↓7$
|M|$↓{11}$ |\↑↑A| |↑↑M|$↓{11}$ \vspace$↓{10}$ |\M↑↑M|.

\ddanger Since it is possible to change the category codes, \TeX\ might
actually use several different categories for the same character on a single
line. For example, Appendix@E contains several ways to coerce \TeX\ to
process text ``↑{verbatim},'' so that the author could prepare this manual
without great difficulty. \ (Try to imagine typesetting a \TeX\ manual;
backslashes and other special characters need to switch back and forth
between their normal categories and category@12!) \ Some care is needed to
get the timing right, but you can make \TeX\ behave in a variety of
different ways by judiciously changing the categories. On the other hand,
it is best not to play with the category codes very often, because you must
remember that characters never change their categories once they have become
tokens.  For example, when the arguments to a macro are first scanned,
they are placed into a token list, so their categories are fixed once and
for all at that time.  The author has intentionally kept the category
codes numeric instead of mnemonic, in order to discourage people from
making extensive use of |\catcode| changes except in unusual
circumstances.

\chapterend

for life's not a paragraph
\quad % he left a blank line here, really
And death i think is no parenthesis.
\author e. e. ↑{cummings}, {\sl since feeling is first\/} (1926)

\bigskip

This coded character set is to facilitate
the general interchange of information
among information processing systems,
communication systems, and
associated equipment.
$\ldots$ An 8-bit set was considered
but the need for more than 128 codes
in general applications was not yet evident.
\author ASA Subcommittee X3.2, ``American Standard Code\linebreak %
for Information Interchange,''\linebreak{↑(ascii)}%
in {\sl Communications of the ACM\/} (1963)

\eject